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Abstract 

Many  emerging  application  domains  require  database  systems  to  support  efficient  access  over  highly  mul¬ 
tidimensional  datasets.  The  current  state-of-the-art  technique  to  indexing  high  dimensional  data  is  to  first 
reduce  the  dimensionality  of  the  data  using  Principal  Component  Analysis  and  then  indexing  the  reduced- 
dimensionality  space  using  a  multidimensional  index  structure.  The  above  technique,  referred  to  as  global 
dimensionality  reduction  (GDR),  works  well  when  the  data  set  is  globally  correlated,  i.e.  most  of  the  variation 
in  the  data  can  be  captured  by  a  few  dimensions.  In  practice,  datasets  are  often  not  globally  correlated.  In  such 
cases,  reducing  the  data  dimensionality  using  GDR  causes  significant  loss  of  distance  information  resulting  in 
a  large  number  of  false  positives  and  hence  a  high  query  cost.  Even  when  a  global  correlation  does  not  exist, 
there  may  exist  subsets  of  data  that  are  locally  correlated.  In  this  paper,  we  propose  a  technique  called  Local 
Dimensionality  Reduction  (LDR  )  that  tries  to  find  local  correlations  in  the  data  and  performs  dimensionality 
reduction  on  the  locally  correlated  clusters  of  data  individually.  We  develop  an  index  structure  that  exploits 
the  correlated  clusters  to  efficiently  support  point,  range  and  k-nearest  neighbor  queries  over  high  dimensional 
datasets.  Our  experiments  on  synthetic  as  well  as  real-life  datasets  show  that  our  technique  (1)  reduces  the 
dimensionality  of  the  data  with  significantly  lower  loss  in  distance  information  compared  to  GDR  and  (2) 
significantly  outperforms  the  GDR,  original  space  indexing  and  linear  scan  techniques  in  terms  of  the  query 
cost  for  both  synthetic  and  real-life  datasets. 


1  Introduction 

With  an  increasing  number  of  new  database  applications  dealing  with  highly  multidimensional  datasets,  tech¬ 
niques  to  support  efficient  query  processing  over  such  data  sets  has  emerged  as  an  important  research  area. 
These  applications  include  multimedia  content-based  retrieval,  exploratory  data  analysis/data  mining,  scientific 
databases,  medical  applications  and  time-series  matching.  For  example,  in  multimedia  retrieval,  the  objects  (e.g., 
images)  arc  represented  by  their  features  (e.g.,  color  histograms,  texture  vectors  and  shape  descriptors)  which 
define  high  dimensional  feature  spaces  (HDFS)  [17,  37].  In  data  mining  applications,  objects  arc  represented 
by  several  numeric  attributes  which  again  define  a  HDFS  over  which  the  data  mining  task  (e.g.,  clustering, 
classification)  is  performed  [3,  33].  HDFSs  arc  also  becoming  increasingly  common  in  scientific  (e.g.,  SDSS’s 
astronomy  database  [41])  and  medical  databases  [30].  To  provide  efficient  access  over  HDFSs,  many  indexing 
techniques  have  been  proposed  in  the  literature.  One  class  of  techniques  comprises  of  high  dimensional  index 

‘This  work  was  supported  by  NSF  CAREER  award  IIS-9734300,  and  in  part  by  the  Army  Research  Laboratory  under  Cooperative 
Agreement  No.  DAALO 1-96-2-0003. 


1 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

2000 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2000  to  00-00-2000 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Locally  Dimensionality  Reduction:  A  New  Approach  to  Indexing  High 

n: _ _ _ • 

5b.  GRANT  NUMBER 

DllUCUMUlUU  OpitCCS 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

University  of  Illinois  at  Urbana-Champaign, Department  of  Computer 
Science, 201  N.  Goodwin  A  venue, Urbana,IL, 61802-2302 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 

OF  PAGES 

26 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Figure  1:  Global  and  Local  Dimensionality  Reduction  Techniques  (a)  GDR(from  2-d  to  1-d)  on  globally  corre¬ 
lated  data  (b)  GDR  (from  2-d  to  1-d)  on  globally  non-correlated  (but  locally  correlated)  data  (c)  LDR  (from  2-d 
to  1-d)  on  the  same  data  as  in  (b) 

trees  [5,  44,  28,  11,  31,  7],  Although  these  index  structures  work  well  in  low  to  medium  dimensionality  spaces 
(upto  20-30  dimensions),  a  simple  sequential  scan  usually  performs  better  at  higher  dimensionalities  [6,  43], 

To  scale  to  higher  dimensionalities,  a  commonly  used  approach  is  dimensionality  reduction  [20],  This  tech¬ 
nique  has  been  proposed  for  both  multimedia  retrieval  [17,  36,  27,  42]  and  data  mining  ([18,  4,  21])  applications. 
The  idea  is  to  first  reduce  the  dimensionality  of  the  data  and  then  index  the  reduced  space  using  a  multidimen¬ 
sional  index  structure  [17],  Most  of  the  information  in  the  dataset  is  condensed  to  a  few  dimensions  (the  first 
few  principal  components  (PCs))  by  using  principal  component  analysis  (PCA).  The  PCs  can  be  arbitrarily  ori¬ 
ented  with  respect  to  the  original  axes  (see  Appendix  A  for  details  on  PCA).  The  remaining  dimensions  (i.e.  the 
later  components)  are  eliminated  and  the  index  is  built  on  the  reduced  space.  To  answer  queries,  the  query  is 
first  mapped  to  the  reduced  space  and  then  executed  on  the  index  structure.  Since  the  distance  in  the  reduced¬ 
dimensional  space  lower  bounds  the  distance  in  the  original  space,  the  query  processing  algorithm  can  guarantee 
no  false  dismissals  [17,  16],  The  answer  set  returned  can  have  false  positives  (i.e.  false  admissions)  which  are 
eliminated  before  it  is  returned  to  the  user.  We  refer  to  this  technique  as  global  dimensionality  reduction  (GDR) 
i.e.  dimensionality  reduction  over  the  entire  dataset  taken  together. 

GDR  works  well  when  the  dataset  is  globally  correlated  i.e.  most  of  the  variation  in  the  data  can  be  captured 
by  a  few  orthonormal  dimensions  (the  first  few  PCs).  Such  a  case  is  illustrated  in  Figure  1(a)  where  a  single 
dimension  (the  first  PC)  captures  the  variation  of  data  in  the  2-d  space.  In  such  cases,  it  is  possible  to  eliminate 
most  of  the  dimensions  (the  later  PCs)  with  little  or  no  loss  of  distance  information.  However,  in  practice,  the 
dataset  may  not  be  globally  correlated  (see  Figure  1(b)).  In  such  cases,  reducing  the  data  dimensionality  using 
GDR  will  cause  a  significant  loss  of  distance  information.  Loss  in  distance  information  is  manifested  by  a  large 
number  of  false  positives  and  is  measured  by  precision  [27]  (cf.  Section  5).  More  the  loss,  larger  the  number 
of  false  positives,  lower  the  precision.  False  positives  increase  the  cost  of  the  query  by  (1)  causing  the  query 
to  make  unnecessary  accesses  to  nodes  of  the  index  structure  and  (2)  adding  to  the  post-processing  cost  of  the 
query,  that  of  checking  the  objects  returned  by  the  index  and  eliminating  the  false  positives.  The  cost  increases 
with  the  increase  in  the  number  of  false  positives.  Note  that  false  positives  do  not  affect  the  quality  the  answers 
as  they  arc  not  returned  to  the  user. 

Even  when  a  global  correlation  does  not  exist,  there  may  exist  subsets  of  data  that  are  locally  correlated  (e.g., 
the  data  in  Figure  1(b)  is  not  globally  correlated  but  is  locally  correlated  as  shown  in  Figure  1(c)).  Obviously,  the 
correlation  structure  (the  PCs)  differ  from  one  subset  to  another  as  otherwise  they  would  be  globally  correlated. 
We  refer  to  these  subsets  as  correlated  clusters  or  simply  clusters.  1  In  such  cases,  GDR  would  not  be  able  to 

’Note  that  correlated  clusters  (formally  defined  in  Section  3)  differ  from  the  usual  definition  of  clusters  i.e.  a  set  of  spatially  close 
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obtain  a  single  reduced  space  of  desired  dimensionality  for  the  entire  dataset  without  significant  loss  of  query 
accuracy.  If  we  perform  dimensionality  reduction  on  each  cluster  individually  (assuming  we  can  find  the  clusters) 
rather  than  on  the  entire  dataset,  we  can  obtain  a  set  of  different  reduced  spaces  of  desired  dimensionality  (as 
shown  in  Figure  1(c))  which  together  cover  the  entire  dataset 2  but  achieves  it  with  minimal  loss  of  query  precision 
and  hence  significantly  lower  query  cost.  We  refer  to  this  approach  as  local  dimensionality  reduction  (LDR). 

Contributions:  In  this  paper,  we  propose  LDR  as  an  approach  to  high  dimensional  indexing.  Our  contribu¬ 
tions  can  be  summarized  as  follows: 

•  We  develop  an  algorithm  to  discover  correlated  clusters  in  the  dataset.  Like  any  clustering  problem,  the 
problem,  in  general,  is  NP-Hard  [32],  Hence,  our  algorithm  is  heuristic-based.  Our  algorithm  performs 
dimensionality  reduction  of  each  cluster  individually  to  obtain  the  reduced  space  (referred  to  as  subspace) 
for  each  cluster.  The  data  items  that  do  not  belong  to  any  cluster  arc  outputted  as  outliers.  The  algorithm 
allows  the  user  to  control  the  amount  of  information  loss  incurred  by  dimensionality  reduction  and  hence 
the  query  precision/cost. 

•  We  present  a  technique  to  index  the  subspaces  individually.  We  present  query  processing  algorithms  for 
point,  range  and  k-nearest  neighbor  (k-NN)  queries  that  execute  on  the  index  structure.  Unlike  many 
previous  techniques  [27,  42],  our  algorithms  guarantee  correctness  of  the  result  i.e.  returns  exactly  the 
same  answers  as  if  the  query  executed  on  the  original  space.  In  other  words,  the  answer  set  returned  to  the 
user  has  no  false  positives  or  false  negatives. 

•  We  perform  extensive  experiments  on  synthetic  as  well  as  real-life  datasets  to  evaluate  the  effectiveness 
of  LDR  as  an  indexing  technique  and  compare  it  with  other  techniques,  namely,  GDR,  index  structure 
on  the  original  HDFS  (referred  to  as  the  original  space  indexing  (OSI)  technique)  and  linear  scan.  Our 
experiments  show  that  (1)  LDR  can  reduce  dimensionality  with  significantly  lower  loss  in  query  precision 
as  compared  to  GDR  technique.  For  the  same  reduced  dimensionality,  LDR  outperforms  GDR  by  almost  an 
order  of  magnitude  in  terms  of  precision,  and  (2)  LDR  performs  significantly  better  than  other  techniques, 
namely  GDR,  original  space  indexing  and  sequential  scan,  in  terms  of  query  cost  for  both  synthetic  and 
real-life  datasets. 

Roadmap:  The  rest  of  the  paper  is  organized  as  follows.  In  Section  2,  we  provide  an  overview  of  related 
work.  In  Section  3,  we  present  the  algorithm  to  discover  the  correlated  clusters  in  the  data.  Section  4  discusses 
techniques  to  index  the  subspaces  and  support  similarity  queries  on  top  of  the  index  structure.  In  Section  5,  we 
present  the  performance  results.  Section  6  offers  the  final  concluding  remarks. 

2  Related  Work 

In  this  section,  we  discuss  the  related  work  on  high  dimensional  index  structures,  global  dimensionality  reduction 
and  clustering  algorithms. 

High  Dimensional  Index  Structures  Recent  research  on  high  dimensional  indexing  has  lead  to  the  develop¬ 
ment  of  several  index  structures  including  X-tree[5],  SS-tree  [44],  SR-tree  [28],  M-tree  [11],  TV-tree  [31]  and 
Hybrid-tree  [7],  These  index  structures  use  novel  data/space  partitioning  strategies  and  scale  better  to  high  dimen¬ 
sionalities  compared  to  spatial  index  structures  (e.g.,  R-tree,  grid  file).  They  are  extensively  used  for  similarity 
search  in  multimedia  retrieval  [17,  10],  data  mining  [14,  3]  and  decision  support  [40,  13]  applications.  Although 

points.  To  avoid  confusion,  we  refer  to  the  latter  as  spatial  clusters  in  this  paper. 

“The  set  of  reduced  spaces  may  not  necessarily  cover  the  entire  dataset  as  there  may  be  outliers.  We  account  for  outliers  in  our 
algorithm. 
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these  index  structures  can  scale  to  medium  dimensionalities  (upto  20-30  dimensions),  above  a  certain  dimension¬ 
ality  (referred  to  as  the  critical  dimensionality),  they  arc  outperformed  by  a  simple  sequential  scan  through  the 
database  [43,  6].  The  reason  is  that  the  data  space  becomes  sparse  at  high  dimensionalities  causing  the  bounding 
regions  to  become  large.  The  query  ends  up  overlapping  with  most  nodes  of  the  tree  resulting  in  a  large  number 
of  disk  accesses  and  hence  a  high  query  cost.  The  linear  scan  performs  better  in  such  cases  since  sequential  I/O 
is  significantly  cheaper  compared  to  random  I/O.  Obviously,  the  critical  dimensionality  depends  on  the  dataset 
and  the  index  structure  used. 

Global  Dimensionality  Reduction  GDR  techniques  has  been  studied  extensively  in  statistical  pattern  recog¬ 
nition  and  multivariate  data  analysis.  The  principal  component  analysis  (PCA)  or  Karhunen-Loeve  (K-L)  trans¬ 
form  is  the  optimal  way  of  mapping  points  in  a  D -dimensional  space  to  points  in  a  d-dimensional  space  (jl  <  m 
[12,  20].  The  mapping  is  optimal  in  the  sense  it  minimizes  the  mean  square  error  (MSE),  where  the  error  is 
the  distance  between  each  .D-d  point  and  its  d-d  image.  Subsequently,  the  r/-d  space  is  indexed  using  a  mul¬ 
tidimensional  index  structure  and  queries  are  answered  using  the  reduced  dimensional  index  (see  [17,  27]  for 
details). 

Clustering  Clustering  algorithms  have  been  studied  recently  in  the  data  mining  domain  (e.g.,  BIRCH,  CLARANS, 
DBSCAN  and  CURE  algorithms)  [45,  35, 24,  14, 29].  The  algorithms  most  related  to  this  paper  arc  those  that  dis¬ 
cover  patterns  in  low  dimensional  subspaces  [1,  2].  In  [1],  Agarwal  et.  al.  present  an  algorithm,  called  CLIQUE, 
to  discover“dense”  regions  in  all  subspaces  of  the  original  data  space.  The  algorithm  works  from  lower  to  higher 
dimensionality  subspaces:  it  starts  by  discovering  1-d  dense  units  and  iteratively  discovers  all  dense  units  in 
each  k-d  subspace  by  building  from  the  dense  units  in  (k-l)-d  subspaces.  In  [2],  Aggarwal  et.  al.  present  an 
algorithm,  called  PROCLUS,  that  clusters  the  data  based  on  their  correlation  i.e.  partitions  the  data  into  disjoint 
groups  of  correlated  points.  The  authors  use  the  hill  climbing  technique,  popular  in  spatial  cluster  analysis,  to 
determine  the  projected  clusters.  Neither  CLIQUE,  nor  PROCLUS  can  be  used  as  an  LDR  technique  since  they 
cannot  discover  clusters  when  the  principal  components  are  arbitrarily  oriented.  They  can  discover  only  those 
clusters  that  arc  correlated  along  one  or  more  of  the  original  dimensions.  The  above  techniques  arc  meant  for 
discovering  interesting  patterns  in  the  data;  since  correlation  along  arbitrarily  oriented  components  is  usually  not 
that  interesting  to  the  user,  they  do  not  attempt  to  discover  such  correlation.  On  the  contrary,  the  goal  of  LDR  is 
efficient  indexing;  it  must  be  able  to  discover  such  correlation  in  order  to  minimize  the  loss  of  information  and 
make  indexing  efficient.  Also,  since  the  motivation  of  their  work  is  pattern  discovery  and  not  indexing,  they  do 
not  address  the  indexing  and  query  processing  issues  which  we  have  addressed  in  this  paper.  To  the  best  of  our 
knowledge,  this  is  the  first  paper  that  proposes  to  exploit  the  local  correlations  in  data  for  the  puipose  of  indexing. 

3  Identifying  Correlated  Clusters 

In  this  section,  we  formally  define  the  notion  of  correlated  clusters  and  present  an  algorithm  to  discover  such 
clusters  in  the  data. 

3.1  Definitions 

In  developing  the  algorithm  to  identify  the  correlated  clusters,  we  will  need  the  following  definitions. 

Definition  1  (Cluster  and  Subspace)  Given  a  set  A  of  N  points  in  a  //-dimensional  feature  space,  we  define  a 
cluster  S  as  a  set  As  (As  C  A)  of  locally  correlated  points.  Each  cluster  S  is  defined  by  S  =  (4>s,  ds,  C's,  As) 
where: 
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Symbols 

Definitions 

N 

Number  of  objects  in  the  database 

M 

Maximum  number  of  clusters  desired 

K 

Actual  number  of  clusters  found  (K  <  M) 

D 

Dimensionality  of  the  original  feature  space 

Si 

The  /'th  cluster 

Ci 

Centroid  of  Si 

n-i 

Size  of  Si  (number  of  objects) 

j\.i 

Set  of  points  in  Si 

$,■ 

The  principal  components  of  Si 

i 

The  jth  principal  component  of  Si 

di 

Subspace  dimensionality  of  Si 

e 

Neighborhood  range 

M  axReconDist 

Maximum  Reconstruction  distance 

FracOutliers 

Permissible  fraction  of  outliers 

M  inSize 

Minimum  Size  of  a  cluster 

M  ax  Dim 

Maximum  subspace  dimensionality  of  a  cluster 

O 

Set  of  outliers 

Table  1:  Summary  of  symbols  and  definitions 


■  (d 

•  <I>  s  are  the  principal  components  of  the  cluster,  (I>  S  denoting  the  /' th  principal  component. 

•  ds  is  the  reduced  dimensionality  i.e.  the  number  of  dimensions  retained.  Obviously,  the  retained  dimen- 
sions  correspond  to  the  first  ds  principal  components  (I>  S  .  1  <  i.  <  dg  while  the  eliminated  dimensions 
correspond  to  the  next  (D  —  ds)  components.  Hence  we  use  the  terms  (principal)  components  and  dimen¬ 
sions  interchangeably  in  the  context  of  the  transformed  space. 

•  Cs  =  \C  +  l  *  •  •  .Cf>]  is  the  centroid,  that  stores,  for  each  eliminated  dimension  (I> , .  (ds  +  1)  < 
i  <  D,  a  single  constant  which  is  “representative”  of  the  position  of  every  point  in  the  cluster  along  this 
unrepresented  dimension  (as  we  are  not  storing  their  unique  positions  along  these  dimensions). 

•  As  is  the  set  of  points  in  the  cluster 

The  reduced  dimensionality  space  defined  by  (I>  S  .  I  <  i  <  ds  is  called  the  subspace  of  S.  ds  is  called  the 
subspace  dimensionality  of  S. 


Definition  2  (Reconstruction  Vector)  Given  a  cluster  S  =  (d>s,  ds,  Cs,  -4s).  we  define  the  reconstruction  vec¬ 
tor  ReconV ect(Q ,  S)  of  a  point  Q  from  S  as  follows: 

ReconVect (Q,  S )  =  ^f=(ds+1)(Q  •  ^  (1) 

where  S  denotes  vector  addition  and  •  denotes  scalar  product  (i.e.  Q  •  *  is  the  projection  of  Q  on  as  shown 

in  Figure  2).  ( Q  •  —  C^)  is  the  (scalar-)  distance  of  Q  from  the  centroid  along  each  eliminated  dimension 

and  ReconV e ct or  (Q,  S)  is  the  vector  of  these  distances. 


Definition  3  (Reconstruction  Distance)  Given  a  cluster  S  =  (&$,,  ds,Cs,  As),  we  now  define  the  recon¬ 
struction  distance  (scalar)  ReconDist(Q ,  S,V)  of  a  point  Q  from  S.  V  is  the  distance  function  used  to 
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Centroid  Q, 
(projection  of  E{Q}  on 


Figure  2:  Centroid  and  Reconstruction  Distance. 


define  the  similarity  between  points  in  the  HDFS.  Let  D  be  an  Lp  metric  i.e.  V(P,P/)  =  ||  P  —  P'  ||  = 

[Sf=1(|Fp]  —  C/[*]|)P]1/,p-  We  define  ReconDist\Q ,  S ,  V)  3  as  follows: 

Recon.Dist(Q,  S,U)  =  ReconDist(Q,  S,  Lp)  =  ||  ReconV ect\Q ,  S)  ||p  =  [S^_^9+1)(|Q  •  —  C^l)  ] 

(2) 


For  any  point  Q  mapped  to  the  ds -dimensional  subspace  of  S,  ReconV  ect  (Q,  S)  represents  the  error  in 
the  representation  i.e.  the  vector  difference  between  the  exact  D-dimensional  representation  of  Q  and  its  ap¬ 
proximate  representation  in  the  ds -dimensional  subspace  of  S.  Fligher  the  error,  more  the  amount  of  distance 
information  lost.  When  averaged  over  all  points  in  S,  we  get  the  average  information  loss  in  S  which  is  called 
the  reconstruction  error  of  S. 


Definition  4  (Reconstruction  Error)  The  reconstruction  error  £2(,S  )  of  cluster  S  is  defined  as  the  mean  square 
magnitude  of  Reconl 'ect- (Q,  S)  where  Q  E  As- 

s2(S)  =  D{(||  ReconV ect(Q,S)  ||2)2}  =  ^?={ds+1} E{(C$  -  Q  .  )"}  0) 

where  E(X)  denotes  expected  value  of  X. 


3.2  Constraints  on  Correlated  Clusters 

Our  objective  in  defining  clusters  is  to  identify  low  dimensional  subspaces,  one  for  each  cluster,  that  can  be 
indexed  separately.  We  desire  each  subspace  to  have  as  low  dimensionality  as  possible  without  losing  too  much 
distance  information.  In  order  to  achieve  the  desired  goal,  each  cluster  must  satisfy  the  following  constraints: 

1.  Reconstruction  Distance  Bound:  In  order  to  restrict  the  maximum  representation  error  of  any  point  in 
the  low  dimensional  subspace,  we  enforce  the  reconstruction  distance  of  any  point  P  E  As  to  satisfy 
the  following  condition:  ReconDist.(P,  S)  <  Max  Recon  Dist  where  MaxReconDist  is  a  parameter 
specified  by  the  user.  This  condition  restricts  the  amount  of  information  lost  within  each  cluster  and  hence 
guarantees  a  high  precision  which  in  turn  implies  lower  query  cost. 

3 Assuming  that  D  is  a  fixed  Lp  metric,  we  usually  omit  the  V  in  ReconDist(Q ,  S,  V)  for  simplicity  of  notation. 
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2.  Dimensionality  Bound:  For  efficient  indexing,  we  want  the  subspace  dimensionality  to  be  as  low  as 
possible  while  still  maintaining  high  query  precision.  A  cluster  must  not  retain  any  more  dimensions  that 
necessary.  In  other  words,  it  must  retain  the  minimum  number  of  dimensions  required  to  accommodate 
the  points  in  the  dataset.  Note  than  a  cluster  ,S  can  accommodate  a  point  P  only  if  ReconDist(P ,  ,S  )  < 
MaxReconDist.  To  ensure  that  the  subspace  dimensionality  ds  is  below  the  critical  dimensionality  of 
the  multidimensional  index  structure  (i.e.  the  dimensionality  above  which  a  sequential  scan  is  better),  we 
enforce  the  following  condition:  ds  <  MaxDim  where  MaxDim  is  specified  by  the  user. 

3.  Choice  of  Centroid:  For  each  cluster  S,  we  use  PCA  to  determine  the  subspace  i.e.  4>s  is  the  set  of 
eigenvectors  of  the  covariance  matrix  of  As  sorted  based  on  their  eigenvalues.  [20]  shows  that  for  a  given 
choice  of  reduced  dimensionality  ds,  the  reconstruction  error  f  2  ( .S' )  is  minimized  by  choosing  the  first  ds 
components  among  <I>  s  and  choosing  Cs  to  be  the  mean  value  of  the  points  (i.e.  the  centroid)  projected 
on  the  eliminated  dimensions.  To  minimize  the  information  loss,  we  choose  =  E{P  •  <h^}  = 
E{P}  •  4>^  (see  Figure  2). 

4.  Size  Bound:  Finally,  we  desire  each  cluster  to  have  a  minimum  cardinality  (number  of  points)  :  ns  > 
MinSize  where  MinSize  is  user-specified.  The  clusters  that  arc  too  small  arc  considered  to  be  outliers. 

The  goal  of  the  LDR  algorithm  described  below  is  to  discover  the  set  S  =  ,S)  ,  ,S'2 , . . . ,  Sk  of  K  clusters  (where 
K  <  M,  M  being  the  maximum  number  of  clusters  desired)  that  exists  in  the  data  and  that  satisfy  the  above 
constraints.  The  remaining  points,  that  do  not  belong  to  any  of  the  clusters,  arc  placed  in  the  outlier  set  O. 

3.3  The  Clustering  Algorithm 

Since  the  LDR  algorithm  needs  to  perform  local  correlation  analysis  (i.e.  PCA  on  subsets  of  points  in  the  dataset 
rather  than  the  whole  dataset),  we  need  to  first  identify  the  right  subsets  to  perform  the  analysis  on.  This  poses  a 
cyclic  problem:  how  do  we  identify  the  right  subsets  without  doing  the  correlation  analysis  and  how  do  we  do  the 
analysis  without  knowing  the  subsets.  We  break  the  cycle  by  using  spatial  clusters  as  an  initial  guess  of  the  right 
subsets.  Then  we  perform  PCA  on  each  spatial  cluster  individually.  Finally,  we  ‘recluster'  the  points  based  on 
the  correlation  information  (i.e.  principal  components)  to  obtain  the  correlated  clusters.  The  clustering  algorithm 
is  shown  in  Table  2.  It  takes  a  set  of  points  A  and  a  set  of  clusters  S  as  input.  When  it  is  invoked  for  the  first 
time,  A  is  the  entire  dataset  and  each  cluster  in  S  is  marked  ‘empty’.  At  the  end,  each  identified  cluster  is  marked 
‘complete’  indicating  a  completely  constructed  cluster  (no  further  change);  the  remaining  clusters  remain  marked 
‘empty’.  The  points  that  do  not  belong  to  any  of  the  clusters  arc  placed  to  the  outlier  set  O.  The  details  of  each 
step  is  described  below: 

•  Construct  Spatial  Clusters(Steps  FC1  and  FC2):  The  algorithm  stalls  by  constructing  M  spatial  clusters 
where  M  is  the  maximum  number  of  clusters  desired.  We  use  a  simple  single -pass  partitioning-based 
spatial  clustering  algorithm  to  determine  the  spatial  clusters  [29,  35].  We  first  choose  a  set  of  C  C  A  of 
well-scattered  points  as  the  centroids  such  that  points  that  belong  to  the  same  spatial  cluster  are  not  chosen 
to  serve  as  centroids  to  different  clusters.  Such  a  set  C  is  called  a  piercing  set  [2],  We  achieve  this  by 
ensuring  that  each  point  P  G  C  in  the  set  is  sufficiently  far  from  any  already  chosen  point  P'  G  C  i.e. 
Dist(P,  P')  >  threshold  for  a  user-defined  threshold.  4  This  technique,  proposed  by  Gonzalez  [22],  is 
guaranteed  to  return  a  piercing  if  no  outliers  arc  present.  To  avoid  scanning  though  the  whole  database 

4For  subsequent  invocations  of  FindClusters  procedure  during  the  iterative  algorithm  (Step  2  in  Table  3),  there  may  exist  already 
completed  clusters  (does  not  exist  during  the  initial  invocation).  Hence  P  must  also  be  sufficiently  far  from  all  complete  clusters  formed 
so  far  i.e.  ReconDist(P,  S)  >  threshold  for  each  complete  cluster  S. 
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Clustering  Algorithm 

Input:  Set  of  Points  A,  Set  of  clusters  5  (each  cluster  is  either  empty  or  complete) 

Output:  Some  empty  clusters  are  completed,  the  remaining  points  form  the  set  of  outliers  O 

FindClustersfid, 5,  O) 

FC1:  For  each  empty  cluster,  select  a  random  point  P  E  A  such  that  P  is  sufficiently  far  from  all  completed 
and  valid  clusters.  If  found,  make  P  the  centroid  Cj  and  mark  ,S';  valid. 

FC2:  For  each  point  P  E  A.  add  P  to  the  closest  valid  cluster  Si  (i.e.  i  =  argmin(Di stance (P,  C))))  if  P 
lies  in  the  e-neighborhood  of  C{  i.e.  Distance(P,  C{)  <  e. 

FC3:  For  each  valid  cluster  Si,  compute  the  principal  components  <[> ;  using  PCA.  Remove  all  points  from  A, . 

FC4:  For  each  point  P  E  A,  find  the  valid  cluster  ,S';  that,  among  all  the  valid  clusters  requires  the  min¬ 
imum  subspace  dimensionality  LD(P)  to  satisfy  ReconDist(P,  Si)  <  MaxReconDist  (break  ties 
arbitrarily).  If  LD(P)  <  Max  Dim ,  increment  \  ■  j  for  j  =  0  to  ( LD(P )  —  1)  and 

FC5:  For  each  valid  cluster  Si,  compute  the  subspace  dimensionality  d{  as:  d{  =  {j|F)[j]  < 

FracOutliers  and  Fi[j  —  1]  >  FracOutliers }  where  Fi[j\  =  -^41. 

FC6:  For  each  point  P  E  A,  add  P  to  the  first  valid  cluster  Si  such  that  ReconDist(P,  Si)  < 
M axReconDist.  If  no  such  Si  exists,  add  P  to  O. 

FC7:  If  a  valid  cluster  Si  violates  the  size  constraint  i.e.  (|_4;|  <  M  inSize ),  mark  it  empty.  Re¬ 
move  each  point  P  E  Ai  from  ,S';  and  add  it  to  the  first  succeeding  cluster  Sj  that  satisfies 
ReconDist(P,  Sj )  <  M  axReconDist  or  to  O  if  there  is  no  such  cluster.  Mark  the  other  valid  clusters 
complete.  For  each  complete  cluster  Si,  map  each  point  P  E  Ai  to  the  subspace  and  store  it  along  with 

ReconDist(P ,  S,  V). 

Table  2:  Clustering  Algorithm 


to  choose  the  centroids,  we  first  construct  a  random  sample  of  the  dataset  and  choose  the  centroids  from 
the  sample  [2,  19,  24],  We  choose  the  sample  to  be  large  enough  (using  Chernoff  bounds  [34])  such  that 
the  probability  of  missing  clusters  due  to  sampling  is  low  i.e.  there  is  at  least  one  point  from  each  cluster 
present  in  the  sample  with  a  high  probability  [24],  Once  the  centroids  arc  chosen,  we  group  each  point 
P  E  A  with  the  closest  centroid  Cciosest  if  Distance(P ,  C closest )  <  e  and  update  the  centroid  to  reflect  the 
mean  position  of  its  group.  If  Distance  (P,Cci0Sest  )  >  e,  we  ignore  P.  The  restriction  of  the  neighborhood 
range  to  e  makes  the  correlation  analysis  localized.  Smaller  the  value  of  e,  the  more  localized  the  analysis. 
At  the  same  time,  e  has  to  be  large  enough  so  that  we  get  a  sufficiently  large  number  of  points  in  the  cluster 
which  is  necessary  for  the  correlation  analysis  to  be  robust. 

•  Compute  PCs(Step  FC3):  Once  we  have  the  spatial  clusters,  we  perform  PCA  on  each  spatial  cluster  .S'; 
individually  to  obtain  the  principal  components  4*^,  i  =  [1,  D\  (see  Appendix  A  for  details  on  PCA).  We 
do  not  eliminate  any  components  yet.  We  compute  the  mean  value  .1/;  of  the  points  in  St  so  that  we  can 
compute  ReconDist(P ,  .S'; )  in  Steps  FC4  and  FC5  for  any  choice  of  subspace  dimensionality  r/;.  Finally, 
we  remove  the  points  from  the  spatial  clusters  so  that  they  can  be  reclustered  as  described  in  Step  FC6. 

•  Determine  Subspace  Dimensionality(Steps  FC4  and  FC5):  For  each  cluster  Si,  we  must  retain  no  more 
dimensions  than  necessary  to  accommodate  the  points  in  the  dataset  (except  the  outliers).  To  determine  the 


Fractions  of  points  violating  reconstruction  distance 


Figure  3:  Determining  subspace  dimensionality 
(MaxDim=32). 


Figure  4:  Splitting  of  correlated  clusters  due  to  ini¬ 
tial  spatial  clustering. 


number  of  dimensions  d;  to  be  retained  for  each  cluster  ,S';,  we  first  determine,  for  each  point  P  G  A,  the 
best  cluster,  if  one  exists,  for  placing  P.  Let  LD(P ,  .S';)  denote  the  the  least  dimensionality  needed  for  the 
clusters';  to  represent  P  with  ReconDist(P, ,  S';)  <  MaxReconDist.  Formally, 

LD(P,Si )  =  {d  |  ReconDist\P,  Si)  <  MaxReconDist  if  di  >  d  (4) 

and  ReconDist(P,  Si)  >  MaxReconDist  otherwise  (5) 

In  other  words,  the  first  LD{P ,  ,S';)  PCs  are  just  enough  to  satisfy  the  above  constraint.  Note  that  such 
a  LD{P,St)  always  exists  for  a  non-negative  MaxReconDist.  Let  LD{P)  =  min  {  LD{P,  St)\St 
is  a  valid  cluster  }.  If  LD{P)  <  MaxDim,  there  exists  a  cluster  that  can  accommodate  P  without 
violating  the  dimensionality  bound.  Let  LD{P,  Si)  =  LD{P)  (if  there  are  multiple  such  clusters  ,S';,  break 
ties  arbitrarily).  We  say  Si  is  the  “best”  cluster  for  placing  P  since  ,S';  is  the  cluster  that,  among  all  the 
valid  clusters,  needs  to  retain  the  minimum  number  of  dimensions  to  accommodate  P.  P  would  satisfy 
the  ReconDist(P,  Si)  <  MaxReconDist  bound  if  the  subspace  dimensionality  di  of  ,S';  is  such  that 
LD{P,St)  <  di  <  MaxDim  and  would  violate  it  if  0  <  di  <  LD{P ,  Sj).  For  each  cluster  S';,  we 
maintain  this  information  as  a  count  array  Vt[j],j  =  [0,  MaxDim]  where  \)[j]  is  the  number  of  points 
that,  among  the  points  chosen  to  be  placed  in  5';,  would  violate  the  ReconDist(P ,  5';)  <  MaxReconDist 
constraint  if  the  subspace  dimensionality  di  is  j :  so  in  this  case  (for  point  P),  we  must  increment  \)[j]  for 
j  =  0  to  ( LD{P ,  Si)  —  1)  and  the  total  count  nt  of  points  chosen  to  be  placed  in  5';.  (V;[j]  and  nt  is 
initialized  to  0  before  FC4  begins).  On  the  other  hand,  if  LD{P)  >  MaxDim,  there  exists  no  cluster  in 
which  P  can  be  placed  without  violating  the  dimensionality  bound;  so  we  do  nothing. 

At  the  end  of  the  pass  over  the  dataset,  for  each  cluster  S;,  we  have  computed  Vt[j],j  =  [0,  MaxDim] 
and  ??;.  We  use  this  to  compute  Fi[j],j  =  [0,  MaxDim]  where  Fi[j]  is  the  fraction  of  points  that,  among 
those  chosen  to  be  placed  in  ,S';  (during  FC4),  would  violate  the  ReconDist(P ,  ,S';)  <  MaxReconDist 
constraint  if  the  subspace  dimensionality  di  is  j  i.e.  /■;[/]  =  ^1.  An  example  of  I  )  from  one  of  the 
experiments  conducted  on  the  real  life  dataset  (cf.  Section  5.3)  is  shown  in  Figure  3.  We  choose  d;  to  be 
as  low  as  possible  without  too  many  points  violating  the  reconstruction  distance  bound  i.e.  not  more  than 
FracOutliers  fraction  of  points  in  ,S';  where  FracO  utliers  is  specified  by  the  user.  In  other  words,  di 
is  the  minimum  number  of  dimensions  that  must  be  retained  so  that  the  fraction  of  points  that  violate  the 
ReconDist(P,  Si)  <  MaxReconDist  constraint  is  no  more  that  FracOutliers  i.e.  di  =  {j\Fi[j]  < 
FracOutliers  and  F;[j  —  1]  >  FracOutliers}.  In  Figure  3,  di  is  21  for  FracOutliers  =  0.1,  16  for 
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FracOutliers  =  0.2  and  14  for  FracOutliers  =  0.3.  We  now  have  all  the  subspaces  formed.  In  the 
next  step,  we  assign  the  points  to  the  clusters. 

•  Recluster  Points(Step  FC6):  In  the  reclustering  step,  we  reassign  each  point  P  E  A  to  a  cluster  S  that 
covers  P  i.e.  ReconDist(P ,  -S')  <  MaxReconDist.  If  there  exists  no  such  cluster,  P  is  added  to  the 
outlier  set  O.  If  there  exists  just  one  cluster  that  covers  P,  P  is  assigned  to  that  cluster.  Now  we  consider 
the  interesting  case  of  multiple  clusters  covering  P.  In  this  case,  there  is  a  possibility  that  some  of  these 
clusters  are  actually  parts  of  the  same  correlated  cluster  but  has  been  split  due  to  the  initial  spatial  clustering. 
This  is  illustrated  in  Figure  4.  Since  points  in  a  correlated  cluster  can  be  spatially  distant  from  each  other 
(e.g.,  form  an  elongated  cluster  in  Figure  4)  and  spatial  clustering  only  clusters  spatially  close  points,  it  may 
end  up  putting  correlated  points  in  different  spatial  clusters,  thus  breaking  up  a  single  correlated  cluster 
into  two  or  more  clusters.  Although  such  ‘splitting’  does  not  affect  the  indexing  cost  of  our  technique  for 
range  queries  and  k-NN  queries,  it  increases  the  cost  of  point  search  and  deletion  as  multiple  clusters  may 
need  to  searched  in  contrast  to  just  one  when  there  is  no  ‘splitting’,  (cf.  Section  4.2.1).  Hence,  we  must 
detect  these  ‘broken’  clusters  and  merge  them  back  together.  We  achieve  this  by  maintaining  the  clusters 
in  some  fixed  order  (e.g.,  order  in  which  they  were  created).  For  each  point  P  E  V,  we  check  each  cluster 
sequentially  in  that  order  and  assign  it  to  the  first  cluster  that  covers  P.  If  two  (or  more)  clusters  arc  paid 
of  the  same  correlated  cluster,  most  points  will  be  covered  by  all  of  them  but  will  always  be  assigned  to 
only  one  them,  whichever  appeal's  first  in  the  order.  This  effectively  merges  the  clusters  into  one  since  only 
the  first  one  will  remain  while  the  others  will  end  up  being  almost  empty  and  will  be  discarded  due  to  the 
violation  of  size  bound  in  FC7.  Note  that  the  FracOutliers  bound  in  Step  FC5  still  holds  i.e.  besides 
the  points  for  which  LD{P)  >  MaxDim,  no  more  that  FracOutliers  fraction  of  points  can  become 
outliers. 

•  Map  Points(Step  FC7):  In  the  final  step  of  the  algorithm,  we  eliminate  clusters  that  violate  the  size  con¬ 
straint.  We  remove  each  point  from  these  clusters  and  add  it  to  the  first  succeeding  valid  cluster  Sj  that 
satisfies  the  ReconDist(P ,  Sj)  <  MaxReconDist  bound  or  to  O  otherwise.  For  the  remaining  clusters 
Si,  we  map  each  point  P  E  Ai  to  the  subspace  by  projecting  P  to  1  <  j  <  d ;  and  refer  it  as  the 
(r/j-d)  image  Image(P ,  Si)  of  P: 

I  mage  (P,  -S';)  [j]  =  P  •  for  1  <  j  <  d'i  (6) 

We  refer  to  P  as  the  (l)-d)  original  Original (  /  mag<  (  D.  Si),  Si)  of  its  image  Image(P,  Si).  We  store  the 
image  of  each  point  along  with  the  reconstruction  distance  ReconDist(P,  Si). 

Since  FindClusters  chooses  the  initial  centroids  from  a  random  sample,  there  is  a  risk  of  missing  out  some 
clusters.  One  way  to  reduce  this  risk  is  to  choose  a  large  number  of  initial  centroids  but  at  the  cost  of  slowing 
down  the  clustering  algorithm.  We  reduce  the  risk  of  missing  clusters  by  trying  to  discover  more  clusters,  if 
there  exists,  among  the  points  returned  as  outliers  by  the  initial  invocation  of  FindClusters.  We  iterate  the  above 
process  as  long  as  new  clusters  are  still  being  discovered  as  shown  below: 

Iterative  Clustering 

(1)  FindClusters(.4,  5,  0)\  /*  initial  invocation  */ 

(2)  Let  O'  be  an  empty  set.  Invoke  FindClusters((7,  5,  O').  Make  O'  the  new  outlier  set  i.e. 

O  <—  O' .  If  new  clusters  found,  go  to  (2).  Else  return. 

Table  3:  Iterative  Clustering  Algorithm 
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The  above  iterative  clustering  algorithm  is  somewhat  similar  to  the  hill  climbing  technique,  commonly  used 
in  spatial  clustering  algorithms  (especially  in  partitioning-based  clustering  algorithms  like  k-means,  k-medoids 
and  CLARANS  [29,  35]).  In  this  technique,  the  “bad  quality”  clusters  (the  ones  that  violate  the  size  bound)  arc 
discarded  (Step  FC7)  and  is  replaced,  if  possible,  by  better  quality  clusters.  However,  unlike  the  hill  climbing 
approach  where  all  the  points  arc  reassigned  to  the  clusters,  we  do  not  reassign  the  points  already  assigned  to 
the  ‘complete’  clusters.  Alternatively,  we  can  follow  the  hill  climbing  approach  but  it  is  computationally  more 
expensive  and  requires  more  scans  of  the  database  [35]. 

Cost  Analysis:  We  conclude  this  section  with  a  analysis  of  the  cost  of  the  clustering  algorithm.  Let  us  first 
analyze  the  cost  of  the  first  invocation  of  the  FindClusters  procedure  (where  A  is  the  whole  dataset).  The  centroid 
selection  step  (FC1)  has  a  small  cost  since  we  arc  using  a  random  sample  and  \sample\  <C  \A\.  Step  FC2  requires 
one  pass  through  the  dataset  A  and  has  a  time  complexity  of  O(NKD).  Step  FC3  has  a  complexity  of  0{ntD2) 
for  each  cluster  .S';  and  hence  an  overall  complexity  of  0( ND2)  (since  <  A  ).  This  step  also  has  a  memory 
requirement  of  0{ntD)  for  each  cluster  and  hence  a  maximum  of  0{maxt{ni)D)  which  is  smaller  than  the 
memory  requirement  of  0 (Al) )  of  GDR.  This  is  an  advantage  of  LDR  over  GDR:  while  the  latter  requires  the 
whole  dataset  to  fit  in  memory,  the  former  requires  only  the  points  in  the  cluster  to  fit  in  memory.  In  either  case, 
if  the  memory  is  too  small,  we  can  perform  SVD  on  a  sample  rather  than  the  whole  data  [27].  Step  FC4  requires 
another  pass  through  the  database  and  has  a  time  complexity  of  0{ND2K)  (assuming  MaxDim  is  a  constant). 
Step  FC5  is  a  simple  step  with  a  complexity  of  0(K D).  Step  FC6  requires  a  final  pass  through  the  database  and 
has  a  time  complexity  of  0(ND2K).  Also,  the  first  invocation  of  FindClusters  accounts  for  most  of  the  cost  of 
the  algorithm  since  the  later  invocations  have  much  smaller  sets  as  input  and  hence  much  smaller  cost.  Thus,  the 
algorithm  requires  three  passes  through  the  dataset  (FC2,FC4  and  FC6)  and  a  time  complexity  of  0(ND2K). 

4  Indexing  Correlated  Clusters 

Having  developed  the  technique  to  find  the  correlated  clusters,  we  now  shift  our  attention  to  how  to  use  them  for 
indexing.  Our  objective  is  to  develop  a  data  structure  that  exploits  the  correlated  clusters  to  efficiently  support 
range  and  k-NN  queries  over  HDFSs.  The  developed  data  structure  must  also  be  able  to  handle  insertions  and 
deletions. 

4.1  Data  Structure 

The  data  structure,  referred  to  as  the  global  index  structure  (GI)  (i.e.  index  on  entire  dataset),  consists  of  separate 
multidimensional  indices  for  each  cluster,  connected  to  a  single  root  node.  The  global  index  structure  is  shown 
in  Figure  5.  We  explain  the  various  components  in  details  below: 

•  The  Root  Node  R  of  GI  contains  the  following  information  for  each  cluster  .S'; :  (1)  a  pointer  to  the  root 
node  Ri  (i.e.  the  address  of  disk  block  containing  //,)  of  the  cluster  index  /;  (the  multidimensional  index 
on  Si),  (2)  the  principal  components  4> ,  (3)  the  subspace  dimensionality  <7;  and  (4)  the  centroid  Ct.  It  also 
contains  an  access  pointer  O  to  the  outlier  cluster  O.  If  there  is  an  index  on  O  (discussed  later),  0  points 
to  the  root  node  of  that  index;  otherwise,  it  points  to  the  start  of  the  set  of  blocks  on  which  the  outlier 
set  resides  on  disk.  R  may  occupy  one  or  more  disk  blocks  depending  on  the  number  of  clusters  K  and 
original  dimensionality  D. 

•  The  Cluster  Indices:  We  maintain  a  multidimensional  index  /;  for  each  cluster  S)  in  which  we  store  the 
reduced  dimensional  representation  of  the  points  in  Si.  However,  instead  of  building  the  index  /;  on  the 
di-d  subspace  of  ,S  defined  by  \  1  <  j  <  dt,  we  build  /;  on  the  (<7;+  I  )-d  space,  the  first  <7;  dimensions 
of  which  are  defined  by  4>  *7  * .  1  <  j  <  <7;  as  above  while  the  (r/;  +  1 )  th  dimension  is  defined  by  the 
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Root  containing  pointers  to  root  of  each  cluster  index 


Figure  5:  The  global  index  structure 

reconstruction  distance  ReconDist(P ,  Si .  V).  Including  reconstruction  distance  as  a  dimension  helps  to 
improve  query  precision  (as  explained  later).  We  redefine  the  image  N ewImage{P ,  Si)  of  a  point  P  £  Ai 
as  a  [di  +  l)-d  point  (rather  than  a  r/;-d  point),  incorporating  the  reconstruction  distance  as  the  (d;  +  l)th 
dimension: 


NewImage(P ,  Si)[j]  =  Image(P ,  -S'* ) [j ]  =  P  •  for  1  <  j  <  di  (7) 

=  Recon.Dist(P,  Si,  V)  for  j  =  di  +  1  (8) 

The  (di  +  l)-d  cluster  index  /,  is  constructed  by  inserting  the  (r/8  +  l)-d  images  (i.e.  N ewImage(P ,  5';)) 
of  each  point  P  £  A;  into  the  multidimensional  index  structure  using  the  insertion  algorithm  of  the  index 
structure.  Any  disk-based  multidimensional  index  structure  (e.g.,  R-tree  [25],  X-tree  [5],  M-tree  [11], 
Hybrid  Tree  [7])  can  be  used  for  this  puipose.  We  used  the  hybrid  tree  in  our  experiments  since  it  is  a 
space  partitioning  index  structure  (i.e.  has  “dimensionality-independent”  fanout),  is  more  scalable  to  high 
dimensionalities  in  terms  of  query  cost  and  can  support  arbitrary  distance  metrics  [7,  38,  9]. 

•  The  Outlier  Index:  For  the  outlier  set  O,  we  may  or  may  not  build  an  index  depending  on  whether  the 
original  dimensionality  D  is  below  or  above  the  critical  dimensionality.  In  this  paper,  we  assume  that  D  is 
above  the  critical  dimensionality  of  the  index  structure  and  hence  choose  not  to  index  the  outlier  set  (i.e. 
use  sequential  scan  for  it). 

Like  other  database  index  trees  (e.g.,  B-tree,  R-tree),  the  global  index  (GI)  shown  in  Figure  5  is  disk-based. 

But  it  may  not  be  perfectly  height  balanced  i.e.  all  paths  from  R  to  leaf  may  not  be  of  exactly  equal  length.  The 
reason  is  that  the  sizes  and  the  dimensionalities  may  differ  from  one  cluster  to  another  causing  the  cluster  indices 
to  have  different  heights.  We  found  that  GI  is  almost  height  balanced  (i.e.  the  difference  in  the  lengths  of  any  two 
paths  from  R  to  leaf  is  never  more  than  1  or  2)  due  to  the  size  bound  on  the  clusters  (see  Appendix  D  for  details). 
Also,  its  height  cannot  exceed  the  height  of  the  original  space  index  by  more  than  1  (see  Appendix  D  for  details). 

To  guarantee  the  correctness  of  our  query  algorithms  (i.e.  to  ensure  no  false  dismissals),  we  need  to  show  that 
the  cluster  index  distances  lower  bounds  the  actual  distances  in  the  original  D-d  space  [17,  16].  In  other  words, 
for  any  two  D-d  points  P  and  Q,  P(.\ <  w l  magi  (P,.S;),  NewImage(Q,Si ))  must  always  lower  bound  D{  P,  Q). 

Lemma  1  (Lower  Bounding  Lemma)  V(N  ewImage(P,  Si),  NewImage(Q,  Si))  always  lower  bounds  V(P,  Q). 
( Proof  in  Appendix  B). 

Note  that  instead  of  incorporating  reconstruction  distance  as  the  (di  +  1 )  th  dimension,  we  could  have  simply 
constructed  GI  with  each  cluster  index  /,  defined  on  the  corresponding  (T,-d  subspace  ,  l<j<d  Since 
the  lower  bounding  lemma  holds  for  the  d,-d  subspaces  (as  shown  in  [17]),  the  query  processing  algorithms 
described  below  would  have  been  correct.  The  reason  we  use  (V/;  +  l)-d  subspace  is  that  the  distances  in  the 
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[d-i  +  l)-d  subspace  upper  bounds  the  distances  in  the  <7,-d  subspace  and  hence  provides  a  tighter  lower  bound  to 
distances  in  the  original  D-d  space: 

V(Ne  w  I  m  age(  P,  S ;  ),Ne  w  I  m  age(Q,  S ; ) )  = 

[V(Image(P,  Si),  Image(Q,  Si))p  +  \  (ReconDist(P,  Si,  V)  —  ReconDist\Q ,  Si,  V)) |p]1/,p 

=>  V(NewImage(P,  Si),  NewImage(Q,  Si))  >  V(Image(P,  Si),  Image(Q,  Si))  (9) 

Furthermore,  the  difference  between  the  two  (i.e.  V(NewImage{P,  Si),  NewImage{Q,  Si))  and  V(Image{P ,  Si), 
I mage(Q,  Si) ) )  is  usually  significant  when  computing  the  distance  of  the  query  from  a  point  in  the  cluster:  Say, 

P  is  a  point  in  Si  and  Q  is  the  query  point.  Due  to  the  reconstruction  distance  bound,  ReconDist(P,  Si,  V) 
is  always  a  small  number  (<  M ax ReconDist).  On  the  other  hand,  ReconDist(Q,  Si,  V)  can  have  any  arbi¬ 
trary  value  and  is  usually  much  larger  than  ReconDist(P,  Si,  V)),  thus  making  the  difference  quite  significant. 
This  makes  the  distance  computations  in  the  (d;  +  l)-d  more  optimistic  than  that  in  the  dt- d  index  and  hence  a 
better  estimate  of  the  distances  in  the  original  D-d  space.  For  example,  for  a  range  query,  the  range  condition 
(' D(NewImage(P ,  Si),  NewImage{Q,  Si))  <  p)  is  more  optimistic  (i.e.  satisfies  fewer  objects)  than  the  range 
condition  (' D(Image(P ,  Si),  Image(Q,  Si))  <  p),  leading  to  fewer  false  positives.  The  same  is  true  for  k-NN 
queries.  Fewer  false  positives  imply  lower  query  cost.  At  the  same  time,  adding  a  new  dimension  also  increases 
the  cost  of  the  query.  Our  experiments  show  that  decrease  in  the  query  cost  from  fewer  false  positives  offsets  the 
increase  of  the  cost  of  the  adding  a  dimension,  reducing  the  overall  cost  of  the  query  significantly  (cf.  Section  5, 
Figure  12). 

4.2  Query  Processing  over  the  Global  Index 

In  this  section,  we  discuss  how  to  execute  similarity  queries  efficiently  using  the  index  structure  described  above 
(cf.  Figure  5).  We  describe  the  query  processing  algorithm  for  point,  range  and  k-NN  queries.  For  correctness, 
the  query  processing  algorithm  must  guarantee  that  it  always  returns  exactly  the  same  answer  as  the  query  on  the 
original  space  [17,  16].  Often  dimensionality  reduction  techniques  do  not  satisfy  the  correctness  criteria  [27,  42], 

We  show  that  all  our  query  processing  algorithms  satisfy  the  above  criteria. 

4.2.1  Point  Search 

To  find  an  object  0 .  we  first  find  the  cluster  that  contains  0 .  It  is  the  first  cluster  S  (in  the  order  mentioned 
in  Step  FC6)  for  which  the  reconstruction  distance  bound  is  satisfied.  If  such  a  cluster  S  exists,  we  compute 
Newlmage{0,  S)  and  find  it  in  the  corresponding  index  by  invoking  the  point  search  algorithm  of  the  index 
structure.  The  point  search  returns  the  object  if  it  exists  in  the  cluster,  otherwise  it  returns  null.  If  no  such  cluster 
S  exists,  0  must  be,  if  at  all,  in  O.  So  we  sequentially  search  through  O  and  return  it  if  it  exists  in  O. 

4.2.2  Range  Queries 

A  range  query  Q  =  (Q,p,V)  retrieves  all  objects  0  in  the  database  that  satisfies  the  range  condition  T>{Q,  O)  < 
p.  The  algorithm  proceeds  as  follows  (see  Appendix  C  for  pseudocode).  For  each  cluster  Si,  we  map  the  query 
anchor  Q  to  its  (7/;  +  l)-d  image  Qi  (using  the  principal  components  (I>,  and  subspace  dimensionality  <7;  stored  in 
the  root  node  R  of  GI)  and  execute  a  range  query  (with  the  same  range  p)  on  the  corresponding  cluster  index  /, 
by  invoking  the  procedure  RangeSearchOnClusteiindex  on  the  root  node  U,  of  /,  .  RangeSearchOnClusteiindex 
is  the  standard  R-tree-style  recursive  range  search  procedure  that  starts  from  the  root  node  and  explores  the  tree 
in  a  depth-first  fashion.  It  examines  the  current  node  /  :  if  T  is  a  non-leaf  node,  it  recursively  searches  each  child 
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node  N  of  T  that  satisfies  the  condition  MINDIST(Q,  N,V)  <  p  (where  MINDIST(Q,  N,V)  denotes 
the  minimum  distance  of  the  (7/;  +  l)-d  image  of  query  point  to  the  [dt  +  l)-d  bounding  rectangle  of  N  based 
on  distance  function  V  [26,  39]);  if  T  is  a  leaf  node,  it  retrieves  each  data  item  O  stored  in  T  (which  is  the 
N ewlmage  of  the  original  D-d  object)  that  satisfies  the  range  condition  V(Q,  O)  <  p  in  the  (d;  +  1 ) -d  space, 
accesses  the  full  //-dimensional  tuple  on  disk  to  determine  whether  it  is  a  false  positive  and  adds  it  to  the  result 
set  if  it  is  not  a  false  positive  (i.e.  it  also  satisfies  the  range  condition  V(Q,0)  <  p  in  the  original  D-d  space). 
After  all  the  cluster  indices  are  searched,  we  add  all  the  qualifying  points  from  among  the  outliers  to  the  result 
by  performing  a  sequential  scan  on  O.  Since  the  distance  in  the  index  space  lower  bounds  the  distance  in  the 
original  space  (cf.  Lemma  1),  the  above  algorithm  cannot  have  any  false  dismissals.  The  algorithm  cannot  have 
any  false  positives  either  as  they  are  filtered  out  before  adding  to  the  result  set.  The  above  algorithm  thus  returns 
exactly  the  same  answer  as  the  query  on  the  original  space. 

In  the  above  discussion,  we  assumed  that  we  store  the  reduced  representation  of  the  points  (i.e.  the  ‘Newlrn- 
age’s)  in  the  leaf  pages  of  the  cluster  indices.  Another  option  was  to  store  the  original  D-d  point  in  the  leaf  pages 
(although  the  index  is  built  on  the  reduced  space).  With  the  former  option,  the  index  will  have  much  fewer  leaf 
nodes  than  the  latter  due  to  the  smaller  representation.  On  the  other  hand,  in  the  latter  case,  the  false  positives 
can  be  eliminated  at  the  leaf  page  level  while  the  former  would  require  an  additional  page  access  into  the  relation 
(where  the  full  tuple  is  stored)  to  eliminate  false  positives.  Since  the  index  is  usually  a  secondary  index,  we 
assume  that  for  each  match,  we  need  to  access  the  full  tuple  anyway  (to  retrieve  the  additional  attributes).  In  that 
case,  the  extra  cost  of  the  former  option  is  that  of  additional  page  accesses  for  only  the  false  positives  (see  Section 
5.1  for  the  details  on  the  cost  computations).  Our  experiments  show  that  our  technique  usually  operates  in  a  high 
precision  zone  (>  90%)  i.e.  has  very  few  false  positives.  The  experiments  also  show  that  the  smaller  size  of  the 
indices  in  the  former  approach  saves  enough  query  cost  to  compensate  the  few  extra  I/Os  due  to  false  positives. 
Hence  we  store  just  the  N ewlmage s  in  the  leaf  pages  of  the  index  structure. 

4.2.3  k  Nearest  Neighbor  Queries 

A  k-NN  query  Q  =  (Q,  k,  V)  retrieves  a  set  1Z  of  k  objects  such  that  for  any  two  objects  O  G  TZ,  O'  ^  72, 
V(Q,0)  <  V(Q,  O').  The  algorithm  for  k-NN  queries  is  shown  in  Table  4.  Like  the  basic  k-NN  algorithm 
[26,  39],  the  algorithm  uses  a  priority  queue  queue  to  navigate  the  nodes/objects  in  the  database  in  increasing 
order  of  their  distances  from  Q.  Note  that  we  use  a  single  queue  to  navigate  the  entire  global  index  i.e.  we 
explore  the  nodes/objects  of  all  the  cluster  indices  in  an  intermixed  fashion  and  do  not  require  separate  queues 
to  navigate  the  different  clusters.  Each  entry  in  queue  is  either  a  node  or  an  object  and  stores  3  fields:  the  id 
of  the  node/object  T  it  corresponds  to,  the  cluster  S  it  belongs  to  and  its  distance  dist  from  the  query  anchor 
Q.  The  items  (i.e.  nodes/objects)  are  prioritized  based  on  dist  i.e.  the  smallest  item  appeal's  at  the  top  of  the 
queue  (min-priority  queue).  For  nodes,  the  distance  is  defined  by  MIN  DIST  while  for  objects,  it  is  the  the 
point-to-point  distance  [26,  39].  Initially,  for  each  cluster,  we  map  the  query  anchor  Q  to  its  [dt  +  1 ) -d  image 
Qi  using  the  information  stored  in  the  root  node  R  of  GI  (Line  2).  Then,  for  each  cluster  index  /:,  we  compute 
the  distance  Ml  X  I )l S I  [(.)  .  //,.  V)  of  Qt  from  the  root  node  R;  of  /,  and  push  //,  into  queue  along  with  the 
distance  and  the  id  of  the  cluster  .S';  to  which  it  belongs  (Line  3).  We  also  fill  the  set  temp  with  the  k  closest 
neighbors  of  Q  among  the  outliers  by  sequentially  scanning  through  O  (Line  4). 

After  these  initialization  steps,  we  start  navigating  the  index  by  popping  the  item  from  the  top  of  queue  at 
each  step  (Line  11).  If  the  popped  item  is  an  object,  we  compute  the  distance  of  the  original  D-d  object  (by 
accessing  the  full  tuple  on  disk)  from  Q  and  append  it  to  temp  (Lines  12-14).  If  it  a  node,  we  compute  the 
distance  of  each  of  its  children  to  the  appropriate  query  image  Qtop.s  (where  top.S  denotes  the  cluster  which  top 
belongs  to)  and  push  them  into  the  queue  (Lines  15-20).  Note  that  the  image  for  each  cluster  is  computed  just 
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k-NNSearch( Query  Q  =  Q,  k,V)) 

1 

for  (i=l;  i  <  K\  i++) 

2 

Qs,  <-  NewImage(Q,  Sj); 

3 

que ue ,push(Si ,  Ri ,  M IN DIST(Qi,  Ri,V )); 

4 

Add  to  temp  the  k  closest  neighbors  of  Q  among  O  (using  linear  scan) 

5 

while  (not  gueue.IsEmptyO) 

6 

top=queue.Top(); 

7 

for  each  object  O  in  temp  such  that  O.dist  <  top.dist 

8 

temp  temp  —  0; 

9 

result  =  result  U  O; 

10 

retrieved++; 

11 

if  (retrieved  =  k)  return  result ; 

12 

queue. Pop(); 

13 

if  t op. T  is  an  object 

14 

top.dist  =  V(Q,  Original(top.T,top.S))\ 

15 

temp  =  temp  U  top.T; 

16 

else  if  top.T  is  a  leaf  node 

17 

for  each  object  0  in  top.T 

18 

queue. push(top.S,  O,  V(Qtop.s ,  0)); 

19 

else  /*  top.T  is  an  index  node  */ 

20 

for  each  child  N  of  top.T 

21 

que  ue. pushttop.S,  N,  M IN DIST(Qtop  S ,  N ,  V)); 

Table  4:  k-NN  Query. 


once  (in  Step  2)  and  is  reused  here.  We  move  an  object  O  from  temp  to  result  only  when  we  arc  sure  that  it  is 
among  the  k  nearest  neighbors  of  Q  i.e.  there  exists  no  object  O'  f  result  such  that  V(0',  Q)  <  T>(0,  Q)  and 
[result]  <  k.  The  second  condition  is  ensured  by  the  exit  condition  in  Line  11.  The  condition  O.dist  <  top.dist 
in  Line  7  ensures  that  there  exists  no  unexplored  object  O'  such  that  V(0' ,  Q)  <  V{0,Q).  The  proof  is  simple: 
O.dist  <  top. (list  implies  O.dist  <  V{NewImage{0', , S'),  NewImage{Q, , S'))  for  any  unexplored  object 
O'  in  a  cluster  S  (by  the  property  of  min-priority  queue)  which  in  turn  implies  D(() .  Q)  <  D(()' .  Q)  (since 
D[N e  wlmage(0\  S),  N ewIm.age[Q ,  ,S'))  lower  bounds  V[0 Q),  see  Lemma  1).  By  inserting  the  objects  in 
temp  (i.e.  already  explored  items)  into  result  in  increasing  order  of  their  distances  in  the  original  D-d  space  (by 
keeping  temp  sorted),  we  also  ensure  there  exists  no  explored  object  O'  such  that  V(0' ,  Q)  <  D(() .  Q).  This 
shows  that  the  algorithm  returns  the  correct  answer  i.e.  the  exact  set  of  objects  as  the  query  in  the  original  D-d 
space.  It  is  also  easy  to  show  that  the  algorithm  is  I/O  optimal. 

Lemma  2  (Optimality  of  k-NN  algorithm)  The  k-NN  algorithm  is  optimal  i.e.  it  does  not  explore  any  object 
outside  the  range  ofkth  nearest  neighbor.  (Proof  in  Appendix  C). 

4.3  Modifications 

We  assume  that  the  data  is  static  in  order  to  build  the  index.  However,  we  must  support  subsequent  inser¬ 
tions/deletions  of  the  objects  to/from  the  index  efficiently.  To  insert  an  object  O.  we  find  the  first  cluster  S  (in 
the  order  mentioned  earlier)  for  which  the  reconstruction  distance  bound  is  satisfied  i.e.  ReconDistjO ,  S,  V)  < 
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ReconError.  If  such  a  cluster  exists,  we  compute  N ewlmage{0 ,  S)  and  insert  it  into  the  corresponding  index 
using  the  insertion  algorithm  of  the  index  structure.  Otherwise,  we  append  O  to  O. 

The  deletion  algorithm  is  also  simple.  To  delete  an  object  O,  we  first  find  0  by  invoking  the  point  search 
algorithm  (cf.  Section  4.2.1).  If  it  is  found  in  a  cluster,  we  delete  it  using  the  deletion  algorithm  of  the  index 
structure;  else  if  it  is  found  in  (9,  we  delete  it  from  0\  else,  we  return  not  found. 

If  the  database  is  dynamic  (i.e.  frequent  insertions  and  deletions),  the  principal  components  need  to  be  updated 
from  time  to  time.  One  option  is  to  repeat  the  entire  clustering  algorithm  and  construct  the  index  structure  from 
scratch.  This  can  be  done  more  efficiently  using  techniques  proposed  by  Ravi  Kanth  et.  al.  [27].  The  idea  is  to 
use  aggregate  data,  obtained  from  the  cluster  indices,  to  recompute  the  principal  components  for  each  cluster  and 
then  incorporate  the  new  components  back  into  the  cluster  indices.  [27]  shows  that  this  technique  improves  the 
recomputation  time  significantly  without  degrading  the  quality  of  the  index  structure.  We  can  use  their  approach 
to  handle  dynamic  databases.  On  the  other  hand,  if  the  database  is  more  or  less  static  (i.e.  insertions  and  deletions 
are  rare)  as  is  often  the  case  [17,  15],  such  recomputations  are  not  necessary. 

5  Experiments 

In  this  section,  we  present  the  results  of  an  extensive  empirical  study  we  have  conducted  to  (1)  evaluate  the 
effectiveness  of  LDR  as  a  high  dimensional  indexing  technique  and  (2)  compare  it  with  other  techniques,  namely, 
GDR,  original  space  indexing  (OSI)  and  linear  scan.  We  conducted  our  experiments  on  both  synthetic  and  real- 
life  datasets.  The  major  findings  of  our  study  can  be  summarized  as  follows: 

•  High  Precision:  LDR  provides  up  to  an  order  of  magnitude  improvement  in  precision  over  the  GDR 
technique  at  the  same  reduced  dimensionality.  This  indicates  that  LDR  can  achieve  the  same  reduction  as 
GDR  with  significantly  lower  loss  of  distance  information. 

•  Low  Query  Cost:  LDR  consistently  outperforms  other  indexing  techniques,  namely  GDR,  original  space 
indexing  and  sequential  scan,  in  terms  of  query  cost  (combined  I/O  and  CPU  costs)  for  both  synthetic  and 
real-life  datasets. 

Thus,  our  experimental  results  validate  the  thesis  of  this  paper  that  LDR  is  an  effective  indexing  technique  for 
high  dimensional  datasets.  All  experiments  reported  in  this  section  were  conducted  on  a  Sun  Ultra  Enterprise 
450  machine  with  1  GB  of  physical  memory  and  several  GB  of  secondary  storage,  running  Solaris  2.5. 

5.1  Experimental  Methodology 

We  conduct  the  following  two  sets  of  experiments  to  evaluate  the  LDR  technique  and  compare  it  with  other 
indexing  techniques. 

Precision  Experiments  Due  to  dimensionality  reduction,  both  GDR  and  LDR,  cause  loss  of  distance  informa¬ 
tion  (e.g.,  in  Figure  15  in  Appendix  A,  the  distance  between  D  and  E  is  lost  due  to  elimination  of  the  second 
principal  component).  More  the  number  of  dimensions  eliminated,  more  the  amount  of  information  lost.  We 

I  jd  .  .  I 

measure  this  loss  by  precision  defined  as  Precision  =  n0”9"'  ,  where  Rr educed,  and  Roriqinai  are  the  sets  of 

answers  returned  by  the  range  query  on  the  reduced  dimensional  space  and  the  original  HDFS  respectively  [27], 
For  A’-NN  queries,  Roriginai  is  the  set  of  k  actual  answers  while  Rreduced  is  the  set  of  objects  we  need  to  explore 
before  being  sure  that  we  seen  all  the  k  actual  answers.  Note  that  the  set  ( Rreduced.  ~  Roriginai)  represent  the  false 
positives;  so  Precision  =  — |/a;s^pos,t,.i,PS| .  We  repeat  that  since  our  algorithms  guarantee  that  the  user  always 

\^or i ginal  I 

gets  back  the  correct  set  Roriginai  of  answers  (as  if  the  query  executed  in  the  original  HDFS),  precision  does  not 
measure  the  quality  of  the  answers  returned  to  the  user  but  just  the  information  loss  incurred  by  the  DR  technique 
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and  hence  the  query  cost.  For  a  DR  technique,  if  we  fix  the  reduced  dimensionality,  the  higher  the  precision,  the 
lower  the  cost  of  the  query,  the  more  efficient  the  technique.  We  compare  the  GDR  and  LDR  techniques  based 
on  precision  at  fixed  reduced  dimensionalities. 

Cost  Experiments  We  conducted  experiments  to  measure  the  query  cost  (I/O  and  CPU  costs)  for  each  of  the 
following  four  indexing  techniques.  We  describe  how  we  compute  the  I/O  and  CPU  costs  of  the  techniques 
below. 

•  Linear  Scan:  In  this  technique,  we  perform  a  simple  linear  scan  on  the  original  high  dimensional  dataset. 
The  I/O  cost  in  terms  of  sequential  disk  accesses  is  N*^D*sl-euLJJ^D+sl-euf(-ld'i'>_  sjnce  sizeof(id )  <C 
( D  *  sizeof  (float)),  we  will  ignore  the  sizeof(id)  henceforth.  Assuming  sequential  I/O  is  10  times 
faster  than  random  I/O,  the  cost  in  terms  of  the  random  accesses  is  N*s^~ ■  The  CPU  cost  is  the 
cost  of  computing  the  distance  of  the  query  from  each  point  in  the  database. 

•  Original  Space  Indexing  ( OSI):  In  this  technique,  we  build  the  index  on  the  original  HDFS  itself  using  a 
multidimensional  index  structure.  We  use  the  hybrid  tree  as  the  index  structure.  The  I/O  cost  (in  terms  of 
random  disk  accesses)  of  the  query  is  the  number  of  nodes  of  the  index  structure  accessed.  The  CPU  cost 
is  the  CPU  time  (excluding  I/O  wait)  required  to  navigate  the  index  and  return  the  answers. 

•  GDR:  In  this  technique,  we  peform  PCA  on  the  original  dataset,  retain  the  first  few  principal  components 
(depending  on  the  desired  reduced  dimensionality)  and  index  the  reduced  dimensional  space  using  the 
hybrid  tree  index  structure.  In  this  case,  the  I/O  cost  has  2  components:  index  page  accesses  (discussed  in 
OSI)  and  accessing  the  full  tuples  in  the  relation  for  false  positive  elimination  (post  processing  cost).  The 
post  processing  cost  can  be  one  I/O  per  false  positives  in  the  worst  case.  However,  as  observed  in  [23], 
this  assumption  is  overly  pessimistic  (and  is  confirmed  by  our  experiments).  We,  therefore,  assume  the 
postprocessing  I/O  cost  to  be  num-fals^>usltwes  The  total  I/O  cost  (in  number  of  random  disk  accesses) 
is  index_page_access_cost  +  num-fals^-PustUves  _  The  CPU  cost  is  the  sum  of  the  index  CPU  cost  and  the 
post  processing  CPU  cost  i.e.  cost  of  computing  the  distance  of  the  query  from  each  of  the  false  positives. 

•  LDR:  In  this  technique,  we  index  each  cluster  using  the  hybrid  tree  multidimensional  index  structure  and 
used  a  linear  scan  for  the  outlier  set.  For  LDR,  the  I/O  cost  of  a  query  has  3  components:  index  page 
accesses  for  each  cluster  index,  linear  scan  on  the  outlier  set  and  accessing  the  full  tuples  in  the  relation 
(post  processing  cost).  The  total  index  page  access  cost  is  the  total  number  of  nodes  accessed  of  all  the 
cluster  indices  combined.  The  number  of  sequential  disk  accesses  for  the  outlier  scan  is  I  °  I 

The  cost  of  outlier  scan  in  terms  of  random  accesses  is  (lpaJeSi~l*D^  •  The  postprocessing  I/O  cost 
is  num-Iais^.pusitives  ^as  qjSCLISSC(:[  ab0vc).  The  total  I/O  cost  (in  number  of  random  disk  accesses)  is 
index_page_access_cost  +  ^ ° ^ *io '*PageSi~l +  num-false-pustttves '  sjjnilarly,  the  CPU  cost  is  the  sum 
of  the  index  CPU  cost,  outlier  scan  CPU  cost  (i.e.  cost  of  computing  the  distance  of  the  query  from  each 
of  the  outliers)  and  the  post  processing  cost  (i.e.  cost  of  computing  the  distance  of  the  query  from  each  of 
the  false  positives). 

We  chose  the  hybrid  tree  as  the  index  structure  for  our  experiments  since  it  is  a  space  partitioning  index 
structure  (“dimensionality-independent”  fanout)  and  has  been  shown  to  scale  to  high  dimensionalities  [7,  38,  9]. 
5  We  use  a  page  size  of  4KB  for  all  our  experiments. 

5The  performance  gap  between  our  technique  and  the  other  techniques  was  even  greater  with  SR-tree  [28]  as  the  index  structure  due 
to  higher  dimensionality  curse  [7].  We  do  not  report  those  results  here  but  can  be  found  in  the  full  version  of  the  paper  [8], 
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Figure  6:  Sensitivity  of  precision  to  Figure  7:  Sensitivity  of  precision  to  Figure  8:  Sensitivity  of  precision  to 
skew.  number  of  clusters.  degree  of  correlation. 

5.2  Experimental  Results  -  Synthetic  Data  Sets 

Synthetic  Data  Sets  and  Queries  In  order  to  generate  the  synthetic  data,  we  use  a  method  similar  to  that 
discussed  in  [45]  but  appropriately  modified  so  that  we  can  generate  the  different  clusters  in  subspaces  of  dif¬ 
ferent  orientations  and  dimensionalities.  The  synthetic  dataset  generator  is  described  in  Appendix  F.  The  input 
parameters  to  the  data  generator  and  their  default  values  arc  shown  in  Table  6  (Appendix  F). 

We  generated  100  range  queries  by  selecting  their  query  anchors  randomly  from  the  dataset  and  choosing  a 
range  value  such  that  the  average  query  selectivity  is  about  2%.  We  tested  with  only  range  queries  since  the 
k-NN  algorithm,  being  optimal,  is  identical  to  the  range  query  with  the  range  equal  to  the  distance  of  the  A  th 
nearest  neighbor  from  the  query  (Lemma  3).  We  use  L 2  distance  (Euclidean)  as  the  distance  metric.  All  our 
measurements  arc  averaged  over  the  100  queries. 

Precision  Experiments  In  our  first  set  of  experiments,  we  cany  out  a  sensitivity  analysis  of  the  GDR  and 
LDR  techniques  to  parameters  like  skew  in  the  size  of  the  clusters  ( zsize ),  number  of  clusters  (k)  and  degree  of 
correlation  (/>).  In  each  experiment,  we  vary  the  parameter  of  interest  while  the  remaining  parameters  arc  fixed 
at  their  default  values.  We  fix  the  reduced  dimensionality  of  the  GDR  technique  to  15.  We  fix  the  average  sub¬ 
space  dimensionality  of  the  clusters  (i.e.  also  to  15  by  choosing  FracOutliers  and  MaxReconDist 

appropriately  ( FracO  utliers  =  0.1  and  MaxReconDist  =  0.5).  Figure  6  compares  the  precision  of  the  LDR 
technique  with  that  of  GDR  for  various  value  of  zstze.  LDR  achieves  about  3  times  higher  precision  compared 
to  GDR  i.e.  the  latter  has  more  than  three  times  the  number  of  false  positives  as  the  former.  The  precision  of 
neither  technique  changes  significantly  with  the  skew.  Ligure  7  compares  the  precision  of  the  two  techniques  for 
various  values  of  k.  As  expected,  for  one  cluster,  the  two  techniques  arc  identical.  As  k  increases,  the  precision 
of  GDR  deteriorates  while  that  of  LDR  is  independent  of  the  number  of  clusters.  Lor  k  =  10,  LDR  is  almost 
an  order  of  magnitude  better  compared  to  GDR  in  terms  of  precision.  Ligure  8  compares  the  two  techniques  for 
various  values  of  p.  As  the  degree  of  correlation  decreases  (i.e.  the  value  of  p  increases),  the  precision  of  both 
techniques  drop  but  LDR  outperforms  GDR  for  all  values  p.  Ligure  9  shows  the  variation  of  the  precision  with 
the  reduced  dimensionality.  Lor  the  GDR  technique,  we  vary  the  reduced  dimensionality  from  15  to  60.  Lor  the 
LDR  technique,  we  vary  the  FracO  utliers  from  0.2  to  0.01  (0.2,  0.15,  0.1,  0.05,  0.02,  0.01)  causing  the  average 
subspace  dimensionality  to  vary  from  7  to  42  (7,  10,  12,  14,  23  and  42)  ( MaxDim  was  64).  The  precision  of 
both  techniques  increase  with  the  increase  in  reduced  dimensionality.  Once  again,  LDR  consistently  outperforms 
GDR  at  all  dimensionalities.  The  above  experiments  show  that  LDR  is  a  more  effective  dimensionality  reduction 
technique  as  it  can  achieve  the  same  reduction  as  GDR  with  significantly  lower  loss  of  information  (i.e.  high 
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Figure  9:  Sensitivity  of  precision  to  re¬ 
duced  dimensionality. 


Figure  10:  Comparison  of  LDR,  GDR, 
Original  Space  Indexing  and  Linear 
Scan  in  terms  of  I/O  cost.  For  lin¬ 
ear  scan,  the  cost  is  computed  as: 

num  sequential  -disk  -accesses 
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Figure  1 1 :  Comparison  of  LDR,  GDR, 
Original  Space  Indexing  and  Linear 
Scan  in  terms  of  CPU  cost. 


precision)  and  hence  significantly  lower  cost  as  confirmed  in  the  cost  experiments  described  next. 

Cost  Experiments  We  compare  the  4  techniques,  namely  LDR,  GDR,  OSI  and  Linear  Scan,  in  terms  of  query 
cost  for  the  synthetic  dataset.  Figure  10  compares  the  I/O  cost  of  the  4  techniques.  Both  the  LDR  and  GDR 
techniques  have  U-shaped  cost  curves:  when  the  reduced  dimensionality  is  too  low,  there  is  a  high  degree  of 
information  loss  leading  to  a  large  number  of  false  positives  and  hence  a  high  post-processing  cost;  when  it  is 
too  high,  the  index  page  access  cost  becomes  too  high  due  to  dimensionality  curse.  The  optimum  points  lies 
somewhere  in  the  middle:  it  is  at  dimensionality  14  (about  250  random  disk  accesses)  for  LDR  and  at  40  (about 
1200  random  disk  accesses)  for  GDR.  The  I/O  cost  of  OSI  and  Linear  Scan  is  obviously  independent  of  the 
reduced  dimensionality.  LDR  significantly  outperforms  all  the  other  3  techniques  in  terms  of  I/O  cost.  The  only 
technique  that  comes  close  to  LDR  in  terms  of  I/O  cost  is  the  linear  scan  (but  LDR  is  2.5  times  better  as  the 
latter  performs  6274  sequential  accesses  ~  627  random  accesses).  However,  linear  scan  loses  out  mainly  due  to 
its  high  CPU  cost  shown  in  Figure  11.  While  LDR,  GDR  and  OSI  techniques  have  similar  CPU  cost  (at  their 
respective  optimum  points),  the  CPU  cost  linear  scan  is  almost  two  orders  of  magnitude  higher  that  the  rest.  LDR 
has  slightly  higher  CPU  cost  compared  to  GDR  and  OSI  since  it  uses  linear  scan  for  the  outlier  set:  however,  the 
savings  in  the  I/O  cost  over  GDR  and  OSI  (by  a  factor  of  5-6)  far  offsets  the  slightly  higher  CPU  cost. 

5.3  Experimental  Results  -  Real-Life  Data  Sets 

Description  of  Dataset  Our  real-life  data  set  (COLHIST  dataset  [7])  comprises  of  8  X  8  color  histograms  (64-d 
data)  extracted  from  about  70,000  color  images  obtained  from  the  Corel  Database  (http://corel.digitalriver.com/) 
and  is  available  online  at  the  UCI  KDD  Archive  web  site  (http://kdd.ics.uci.edu/databases/CorelFeatures).  We 
generated  100  range  queries  by  selecting  their  query  anchors  randomly  from  the  dataset  and  choosing  a  range 
value  such  that  the  average  query  selectivity  is  about  0.5%.  All  our  measurements  are  averaged  over  the  100 
queries. 

Cost  Experiments  First,  we  evaluate  the  impact  of  adding  ReconDist  as  an  additional  dimension  of  each 
cluster  in  the  LDR  technique.  Figure  12  shows  that  the  additional  dimension  reduces  the  cost  of  the  query 
significantly.  We  performed  the  above  experiment  on  the  synthetic  dataset  as  well  and  observed  a  similar  result. 
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Figure  12:  Effect  of  adding  the  extra 
dimension. 


Figure  13:  Comparison  of  LDR,  GDR, 
Original  Space  Indexing  and  Linear 
Scan  in  terms  of  I/O  cost.  For  lin¬ 
ear  scan,  the  cost  is  computed  as: 

num  ^sequential  -disk  -accesses 
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Figure  14:  Comparison  of  LDR,  GDR, 
Original  Space  Indexing  and  Linear 
Scan  in  terms  of  CPU  cost. 


6  Figure  13  compares  the  4  techniques,  namely  LDR,  GDR,  OSI  and  Linear  Scan,  in  terms  of  I/O  cost.  LDR 
outperforms  all  other  techniques  significantly.  Again,  the  only  technique  that  come  close  to  LDR  in  I/O  cost  (i.e. 
number  of  random  disk  accesses)  is  the  linear  scan.  However,  again,  linear  scan  turns  out  to  significantly  worse 
compared  to  LDR  in  terms  of  the  overall  cost  due  to  its  high  CPU  cost  as  shown  in  Figure  14. 


6  Conclusion 

With  numerous  emerging  applications  requiring  efficient  access  to  high  dimensional  datasets,  there  is  a  need 
for  scalable  techniques  to  indexing  high  dimensional  data.  In  this  paper,  we  proposed  local  dimensionality  re¬ 
duction  (LDR)  as  an  approach  to  indexing  high  dimensional  spaces.  We  developed  an  algorithm  to  discover 
the  locally  correlated  clusters  in  the  dataset  and  perform  dimensionality  reduction  on  each  of  them  individually. 
We  presented  an  index  structure  that  exploits  the  correlated  clusters  to  efficiently  support  similarity  queries  over 
high  dimensional  datasets.  We  have  shown  that  our  query  processing  algorithms  arc  correct  and  optimal.  We 
conducted  an  extensive  experimental  study  with  synthetic  as  well  as  real-life  datasets  to  evaluate  the  effective¬ 
ness  of  our  technique  and  compare  it  to  GDR,  original  space  indexing  and  linear  scan  techniques.  Our  results 
demonstrate  that  our  technique  (1)  reduces  the  dimensionality  of  the  data  with  significantly  lower  loss  in  distance 
information  compared  to  GDR,  outperforming  GDR  by  almost  an  order  of  magnitude  in  terms  of  query  precision 
(for  the  same  reduced  dimensionality)  and  (2)  significantly  outperforms  all  the  other  3  techniques  (namely,  GDR, 
original  space  indexing  and  linear  scan)  in  terms  of  the  query  cost  for  both  synthetic  and  real-life  datasets. 
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The  material  in  this  appendix  can  be  read  at  the  discretion  of  the  reviewer  and  has  been  included  only 
for  the  purpose  of  completeness. 

A  Principal  Component  Analysis 

PCA  examines  the  variance  structure  in  the  data  and  determines  the  directions  along  which  the  data  exhibits 
high  variance.  The  first  principal  component  accounts  for  as  much  of  the  variability  in  the  data  as  possible,  and 
each  succeeding  component  accounts  for  as  much  of  the  remaining  variability  as  possible.  Figure  15  shows  a  set 
of  points  and  the  two  principal  components.  Since  the  first  few  principal  components  account  for  most  of  the 
variation  in  the  data,  the  rest  can  be  eliminated  without  significant  loss  of  information.  For  example,  in  Figure 
15,  the  second  principal  component  can  be  eliminated,  thus  reducing  the  dimensionality  from  2  to  1.  The  1-d 
images  of  the  2-d  points  arc  obtained  by  projecting  them  on  the  first  principal  component  (shown  by  squares  in 
Figure  15).  The  reduced  dimensional  points  arc  then  indexed  using  an  index  structure. 

,  Second  Principal  Component 

(eliminated  dimension) 


First  Principal  Component 
i.e.  direction  of  maximum 
variance  (retained  dimension) 


•  Point  in  2-d  space 

■  Point  in  1-d  space  (after  dim.  reduction) 


Figure  15:  Global  Dimensional  Reduction  (PCA  or  K-L  Transform  technique)  where  D=2,  d=l. 

We  now  describe  how  the  principal  components  arc  computed  algebraically.  Let  A  be  the  N  X  D  data  matrix 
whose  each  row  corresponds  to  a  point  in  the  original  D-dimensional  space.  The  first  principal  component 
is  the  eigenvector  corresponding  to  the  largest  eigenvalue  of  the  variance-covariance  matrix  of  A,  the  second 
component  correspond  to  the  eigenvector  with  the  second  largest  eigenvalue  and  so  on.  The  mapping  (to  reduced 
dimensionality)  corresponds  to  the  well  known  Singular  Value  Decomposition  (SVD)  of  data  matrix  A  and  can 
be  done  in  0(ND2)  time. 

B  Lower  Bounding  Lemma 

Lemma  3  (Lower  Bounding  Lemma)  V(N  ewImage(P. ,  Si),  N  ewImage(Q ,  Si))  always  lower  bounds  V(P,  Q). 
Proof:  Let  Pi  denote  I  mage  (P,  Si)  and  Qi  denote  /  mage  (Q,  Si) .  Let  P'  =  and  Q'  = 


Then,  V(P/,  Q')  =  V(P,  Q)  since  (I>,  is  orthonormal.  Now, 

/"  =  Pi  +  ReconV ect(P,  St)  +  Sf=d.+1cj (10) 
()'  =  Qi  +  ReconV ect(Q ,  St)  +  Sf=d.+1cj (1 1) 
The  vector  distance  Dist(P' ,  Q')  between  P’  and  Q'  is 

Dist(P' ,  Q')  =  Dist(Pi,  Qi)  +  (ReconV ect(P,  Si)  —  ReconV ect(Q ,  Si))  (12) 

=>  V(P',  Q')  =  [ V(Pi ,  Qi)p  +  ||  ReconV ect(P,  St)  \\p  -  ReconV ect(Q,  5,-))p]1/p  (13) 


Since  Lp  functions  obey  triangle  inequality. 


||  ReconVect(P,  Si)  —  ReconV ect(Q ,  Si)  ||p  >  | (ReconDist(P,  Si,  V)  —  ReconDist{Q,  Si,V))\  (14) 

=>  V{P\  Q')  >  [D(Pi,Qi)p  +  | (ReconDist(P,  5,-,  V)  -  ReconDist{Q ,  5,-,  2?))|p]1/p  (15) 

Now, 

V(NewImage(P,  Si),  NewImage(Q,  Si))  =  Qi)p  +  | (ReconDist(P,  Si,  V)  —  ReconDist(Q ,  5) ,  P ) ) |p] 1 

(16) 

Since  V(P' ,  Q')  =  V(P,  Q)  and  from  Equations  15  and  16, 

V(Q,P)  >  V(NewImage(P,  Si) ,  NewImage(Q,  Si))  (17) 


C  Range  Query  Algorithm 


RangeSearch( Query  Q  =  {Q,p,  V)) 

1 

for  (i=l;  i  <  A';  i++) 

2 

Qi  <—  NewImage(Q,  Si); 

3 

Qi  {Qi  1  pi  P)\ 

4 

RangeSearchOnCliisterlndextQ, ,  A,,  S; .  result); 

5 

for  each  0  £  O 

6 

if  T>{Q,0)  <  p  result  result  U  0; 

RangeSearchOnClusterIndex(Query  Q,  Node  T,  Cluster  S,  Set  result) 

1 

if  (T  is  a  non-leaf  node) 

2 

foreach  child  N  of  T 

3 

if  MINDIST(Q,  N,V)  <  (>  RangeSearchOnClusterlndexiQ,  N,  S.  result); 

4 

else  /*  T  is  a  leaf  node  */ 

5 

for  each  object  0  in  T 

6 

if  V(Q,0)  <  p 

7 

if  V {Original {Q ,  S),  Original(0 ,  S))  <  p  result  result  U  0; 

Table  5 :  Range  Query. 


D  Optimality  of  k  nearest  neighbor  algorithm 

Lemma  4  (Optimality  of  k-NN  algorithm)  The  k-NN  algorithm  is  optimal. 

Proof:  Let  a  =  maxoeA^iQi  O)  where  A  is  the  set  of  final  answers  (the  k  nearest  neighbors).  The  algorithm 
is  optimal  if  it  does  not  explore  any  indexed  object  O  (in  any  cluster)  (13-15)  such  that  V{N ewlmage{0 ,  S), 


N ewImage{Q ,  ,S'))  >  a.  Let  us  assume  that  it  does  explore  such  an  object  0 .  When  0  is  explored,  [result]  <  k 
because  otherwise  the  algorithm  would  have  terminated  before  reaching  this  point.  We  will  show  that  when  0  is 
explored,  [result]  is  at  least  k  and  hence  prove  the  lemma  (by  contradiction).  Each  O'  G  A  has  been  explored 
before  O  since  V[N  ewlm.age[0' ,  S) ,  N  ewIm.age[Q ,  ,S'))  <  a  <  V[N ewlmage(0 ,  S) ,  N ewIm.age[Q ,  ,S')) 
(by  property  of  min-priority  queue).  Now  top. (list  =  V(N  ewlmage(0 ,  S)  ,  N  ewIm.age[Q ,  ,S'))  when  O  is 
explored  i.e.  top. (list  >  a.  Since  each  O'  G  A  satisfies  the  condition  V(Q,  O)  <  a,  it  satisfies  the  condition 
V{Q ,  O)  <  top. (list  and  is  hence  added  to  result  (Line  7).  So  [result]  is  at  least  k.  ■ 

E  Analysis  of  the  height  and  balance  of  the  global  index  structure 

Let  hoi  denote  the  the  height  of  GI.  Let  horig  denote  the  height  of  the  original  space  index  i.e.  index  on  the  entire 
dataset  in  the  D-d  original  space.  We  assume  that  the  multidimensional  index  structure  used  as  the  original  space 
index  is  same  as  the  one  used  to  index  the  clusters  (e.g.,  hybrid  tree  in  both  cases).  Then,  hoi  <  1  +  horig.  Since 
Ii  is  built  on  a  subset  of  points  of  the  entire  set  (i.e.  nt  <  N )  and  fewer  dimensions  (i.e.  d;  <  D),  its  height  hjt 
cannot  be  greater  hortg.  Since  hoi  =  1  +  maxthit  and  /?./(  <  hortg  for  all  i ,  hoi  <  1  +  hortg.  The  bound  is  a 
conservative  one  as  the  hoi  is  usually  smaller  than  hortg  due  to  the  reduced  size  of  the  index. 

We  now  show  that  GI  is  almost  height-balanced.  There  are  two  factors  that  affect  the  height  of  a  cluster 
index  /,  :  the  number  of  points  u;  and  the  subspace  dimensionality  d;.  Lower  the  value  of  //;,  lower  the  height. 
Also,  lower  the  value  of  r/8,  lower  the  height.  Let  I  short  be  the  shortest  index.  Note  nshort  >  MinSize.  Let 
C short  and  Fshort  denote  the  average  number  of  entries  in  a  leaf  and  index  node  of  I  short  respectively  Then,  as 
explained  in  [23],  the  minimum  possible  height  of  I  short  is  (1+  \logF,hort  ( [  McnSlze])])  Similarly,  the  maximum 
possible  height  of  tallest  index  Itau  is  (1  +  [logFtall  (  TcTtI  )1 )  since  ntaii  <  N.  For  space  partitioning  index 
structures  (which  is  preferred  for  high  dimensional  indexing  due  to  its  “dimensionality-independent”  fanout), 

/~t  j 

Fshort  ~  Fta.il  (say,  /  )  [7].  C  short,  and  Cta.u  depend  on  the  respective  subspace  dimensionalities  i.e.  ffTP  d.  ]aU  t  • 
The  maximum  difference  lmax  in  the  lengths  of  any  two  paths  from  R  to  leaf  is  rsj  1°3f(  MmSi~e*Ctall  )  i-e- 
lmax  ~  l °Sf ( MinSizif§^hort  )•  Usually,  the  subspace  dimensionalities  are  close  i.e.  dtau  ~  dshort.  For  space- 
partitioning  indexes,  F  is  typically  around  50-100  [7].  Under  the  above  assumptions,  <  1  if  MinSize  > 
and  <  2  if  MinSize  <  ^iuo-  *n  other  words,  with  a  proper  size  bound,  is  usually  1  or  at  most  2, 
implying  that  GI  is  almost  height  balanced. 

F  Synthetic  Data  Generation 

In  order  to  generate  the  synthetic  data,  we  use  a  method  similar-  to  that  discussed  in  [45]  but  appropriately 
modified  so  that  we  can  generate  the  different  clusters  in  subspaces  of  different  orientations  and  dimensionalities. 
The  input  parameters  to  the  data  generator  is  shown  in  Table  6.  The  generator  generates  k  clusters  with  a  total  of 
n.{l—o)  points  distributed  among  them  using  a  Zipfian  distribution  with  value  zsi~e .  The  subspace  dimensionality 
of  each  cluster  also  follows  a  Zipfian  distribution  with  value  Zdim.,  the  average  subspace  dimensionality  being  d. 
Each  cluster  is  generated  as  follows.  For  a  cluster  with  size  nt  and  subspace  dimensionality  r/;  (computed  using 
the  Zipfian  distributions  described  above),  we  randomly  choose  d;  dimensions  among  the  D  dimensions  as  the 
subspace  dimensions  and  generate  nt  points  in  that  drd  plane.  Along  each  of  the  remaining  ( D  —  dt)  non¬ 
subspace  dimensions,  we  assign  a  randomly  chosen  coordinate  to  all  the  n;  points  in  the  cluster.  Let  /;  be  the 
randomly  chosen  coordinate  along  the  jth  non-subspace  dimension.  In  the  subspace,  the  points  are  spatially 
clustered  into  several  regions  (c  regions  on  average)  with  each  region  having  a  randomly  chosen  centroid  and  an 
extent  of  r  from  the  centroid  along  each  of  the  r/;  dimensions.  After  all  the  points  in  the  cluster  are  generated,  each 


Parameter 

Description 

Default  Value 

n 

Total  number  of  points 

100000 

D 

Original  Dimensionality 

64 

k 

Number  of  clusters 

5 

d 

Average  subspace  dimensionality 

10 

Zdim 

Skew  in  subspace  dimensionality  across  clusters 

0.5 

%size 

Skew  in  size  across  clusters 

0.5 

C 

Number  of  spatial  clusters  per  cluster 

10 

r 

Extent  of  a  spatial  cluster  from  centroid  along  each  subspace  dimension 

0.5 

p 

Maximum  displacement  of  points  along  each  non-subspace  dimension 

0.1 

0 

Fraction  outliers 

0.05 

Table  6:  Input  parameters  to  Synthetic  Data  Generator 


point  is  displaced  by  a  distance  of  at  most  p  in  either  direction  along  each  non-subspace  dimension  i.e.  the  point 
is  randomly  placed  somewhere  between  ( /;  —p)  and  ( /;  +  p)  along  the  y  th  non-subspace  dimension.  The  amount 
of  displacement  (i.e.  value  of  p)  determines  the  degree  of  correlation  (since  r  is  fixed).  Lower  the  value,  more 
the  correlation.  To  make  the  subspaces  arbitrarily  oriented,  we  generate  a  random  orthonormal  rotation  matrix 
(generated  using  MATLAB)  and  rotate  the  cluster  by  multiplying  the  data  matrix  with  the  rotation  matrix.  After 
all  the  clusters  arc  generated,  we  randomly  generate  N.o  points  (with  random  values  along  all  D  dimensions)  as 
the  outliers.  The  default  values  of  the  various  parameters  is  shown  in  Table  6. 

G  Sensitivity  to  M axReconDist  parameter 


Figure  16:  Sensitivity  of  I/O  cost  of  Figure  17:  Sensitivity  of  CPU  cost  of 

LDR  technique  to  MaxReconDist.  LDR  technique  to  MaxReconDist. 

Figures  16  and  17  shows  the  sensitivity  of  the  LDR  technique  to  the  MaxReconDist  parameter  in  terms  of 
I/O  and  CPU  costs  respectively.  The  I/O  cost  improves  with  decrease  in  MaxReconDist  due  to  decrease  in 
the  information  loss  (i.e.  fewer  false  positives)  and  hence  decrease  in  post  processing  cost.  However,  with  the 
decrease  in  MaxReconDist ,  the  number  of  outliers  increase  as  fewer  points  satisfy  the  reconstruction  distance 
bound  which  causes  the  CPU  cost  to  increase  (the  cost  of  scanning  the  outlier  set)  as  shown  in  the  Figure  17.  The 
choice  of  MaxReconDist  must  consider  the  combined  I/O  and  CPU  cost;  for  example,  MaxReconDist  =  0.08 
represents  a  good  choice  for  this  real-life  dataset. 


